-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove needless last_root for better reclaims #8148
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8148 +/- ##
========================================
- Coverage 82% 82% -0.1%
========================================
Files 249 249
Lines 53530 53503 -27
========================================
- Hits 43913 43887 -26
+ Misses 9617 9616 -1 |
Seems right to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome detective work as always @ryoqun! Any chance you could write a test for this?
Sad status report: It seems the description is wrong. I can't create a valid test demonstrating the described error condition... And I cannot find another alternative bad scenario at the moment.... This bug is hard. ;) |
4132eb6
to
5cdec64
Compare
5cdec64
to
3120504
Compare
@@ -1156,9 +1153,7 @@ impl AccountsDB { | |||
dead_slots | |||
} | |||
|
|||
fn cleanup_dead_slots(&self, dead_slots: &mut HashSet<Slot>, last_root: u64) { | |||
// a slot is not totally dead until it is older than the root | |||
dead_slots.retain(|slot| *slot < last_root); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized all inputs of this function is ensured to be older or equal to last_root
.
Notice the or equal to
part, this assertion must be *slot <= last_root
for purge_zero_lamport_accounts
. But, we don't need this to begin with, as said above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, why are they guaranteed to be older than last_root?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can explain this by reading the code from here upwards the call graph.
Firstly, this cleanup_dead_slots(dead_slots, last_root)
is only called from handle_reclaims(reclaims)
.
dead_slots
are derived from reclaims
by calling remove_dead_accounts()
inside handle_reclaims()
, which does just simple 1-to-1 storage -> slot mapping. Thus, we just need to explain in terms of reclaims
further on.
handle_reclaims()
are called only from purge_zero_lamport_accounts()
(1) and store_with_hashes()
(2). These are the only 2 code paths reaching to this function.
(1) For the purge_zero_lamport_accounts()
cod epath, reasoning is simple:
The reclaims
is a collection of results of accounts_index.purge(&pubkey)
, which in turn the sum of get_rooted_entries()
for all such zero lamport pubkey
s. So, reclaims
(thus, dead_slots
) are always rooted slots only. This is true for normal operation because the validator calls add_root()
as it roots slots.
Also this is also true for snapshot restoration because generate_index()
is called before purge_zero_lamport_accounts()
and generate_index()
does accounts_index.roots.insert(*slot_id)
.
(2) For the store_with_hashes()
, reasoning is a bit longer:
The reclaims
is ultimately only generated at accounts_index.update()
. There is intermediately fn
s like accounts_db.update_index()
and accounts_index.insert()
, but the destination of this call graph is same.
Next, accounts_index.update(slot, ...)
creates reclaims
by concatenating two sets:
A: can_purge
-able (= slot
< max_root
) entries
B: Old entries in slot
by replacing with updated one
A is obviously fine because max_root
is always member of accounts_index.roots
. B is ensured not to be dead
at remove_dead_accounts()
because the case B is always inserting an updated alive entry into the slot
.
So, as a normal operation, cleanup_dead_slots()
never removes anything such that slot > last_root
.
And leaving this line's retain
causes halfway-deallocated state because cleanup_dead_slots()
is only called once ever for each given slot. That's because later invocations of remove_dead_accounts()
would remove any slots whose storage are already removed.
Thanks for reading a rather long comment. :)
So I think this retain
does add little value and complicates code reading with seemingly overlapping and spread responsibility for ensuring purge
-able slots at caller-site or at callee-site.
As a second thought, if you're a bit concerned for future changes that might break these assumptions, this PR can be changed to assert!
that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Thanks for the description. I think it can be removed.
@@ -161,11 +154,6 @@ impl<T: Clone> AccountsIndex<T> { | |||
} | |||
|
|||
pub fn add_root(&mut self, slot: Slot) { | |||
assert!( | |||
(self.last_root == 0 && slot == 0) || (slot >= self.last_root), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think removing this assertion is ok to remove last_root
, first of all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sakridge is the best person to approve this PR, he's more familiar with the details in this part of the codebase than I
Removing the |
@sakridge Thanks for checking this! I've just merged! |
This got needed in the v0.23 branch for #8436. |
* Restore last_root to fix unintended storage delete * Remove last_root thing altogether * Remove unneeded test... (cherry picked from commit 5872746)
Problem
last_root wasn't correctly restored from snapshot and its related range check is bad for
purge_zero_lamport_account
, which ultimately causes dangling storage entries some times when restoring from snapshot.(a lot of original investigation report for #8130 is moved to new PR: #8176)
Summary of changes
After careful code scrutinization, I'm pretty sure the
last_root
thing can go away.Just remove
last_root
Background
I found this as part of investigation for #8130.
Part of #7167.